=======================================================

Scott Burns

Udacity – Data Analyst Nanodegree

Project 3: Data Analysis with R

========================================================

Introduction and Background

Why I chose this dataset

I live in the city of Oakland, California, and plan to stay in Oakland for the foreseeable future. I’m very curious about crime trends in the city, as they may have an impact on my life and decisions about real estate and more. These topics are also interesting to cover given the popular perception of Oakland as a crime-ridden, dangerous place to live. In exploring dataset options for this project, I came across this dataset on the website OpenOakland.org - which includes all crime reports from the city of Oakland from 2007 to earlier this summer, including incident details such as geographic location and type of crime. It looked fascinating to me, and I decided to dive in.

Questions I’d love to explore

  1. How has the incidence of crime been trending in the city over the last 8 years?
  2. Have incidences of particular types of crime been growing / diminishing differentially over the last few years?
  3. Where is crime occurring geographically?
  4. How has crime incidence changed in particular areas within the city?

Most of all, I’d like to create a few thought-provoking visualizations and see if they might suggest directions for more in-depth exploration.

About the dataset:

The data were downloaded at data.openoakland.org

Additional background on the dataset is available on Rik Belew’s blog

Background on the dataset’s CrimeCat classifications can be found on this explainer page

More detailed background on the dataset is available in Dataset Description section at the end of this file.


Stream-of-consciousness exploration

Getting started - preparing the RStudio environment

In preparation for analysis, I loaded in the dataset of interest, and glanced at summary information about it (output suppressed).

I also created two variables to potentially use throughout subsequent plots.

#1. Create dummy variable for crime - any_crime - for this, any record where `Desc` and `CrimeCat` are not blank is assigned a 1 value
#If a record doesn't even have this minimal information about the crime, I think it's better to exclude this as a criminal incident
crimes$any_crime <- with(crimes, (Desc != "" & CrimeCat != "")) * 1
#head(crimes)

#2. Add column date_format with date representation of `Date` string
crimes$date_format <- as.Date(crimes$Date, format = "%m/%d/%y")

Initial EDA: Rough plots of high-Level crime data

As a first basic attempt to visualize the data, I plotted a histogram of crime reports, using a binwidth of 30 days

ggplot(crimes, aes(x = date_format)) +
  geom_histogram(binwidth = 30)

#From the histogram, we see indications that crime reports have fallen significantly in Oakland over the last 7 years, as report counts per 30-day period in 2007 and 2008 appear to number between 9,000 and 10,000, while counts per period have been around 4,000 in recent years

For another view of Oakland crime dynamics, I created a scatterplot of incidents per day, using the dplyr aggregation techniques we learned in the Data Analysis in R course

#Create new dataframe based on incidents per day
#This uses dplyr 'verbose' method
crime_day <- group_by(crimes, date_format) 
crime_by_day <- summarize(crime_day, 
                          incidents = sum(any_crime),
                          n=n()) 
#head(crime_by_day) #To verify dataframe output

#Basic scatterplot by day
ggplot(aes(x=date_format, y=incidents), data=crime_by_day) +
  geom_point() +
  geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

In the daily scatterplot, the decline of crime reports over time is clearly visible, highlighted by the down-trending geom_smooth line.

As daily crime incident counts are fairly noisy, I wanted to take a look at potentially smoother time increments - creating a scatterplot of any_crime incidents each month

#To create the monthly scatterplot, I cut our date_format data in monthly units, using the syntax we learned in lesson 5 of the Data Analysis in R course
crimes$month <- as.Date(cut(crimes$date_format,
  breaks = "month"))
#head(crimes$month) #Shown to confirm new column has correct values

#As part of the plotting exercise, I first needed to aggregate the crime data by month
#To do so, I applied dplyr methods again, but this time using the 'concise' syntax
crime_by_month <- crimes %>%
  group_by(month) %>%
  summarize(incidents = sum(any_crime),
            n = n()) %>%
  arrange(month)

ggplot(aes(x=month, y=incidents), data=crime_by_month) +
  geom_point() +
  geom_smooth(aes(group = 1))
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

The downward trend in crime reports is again evident in plotted monthly crime report data. Another striking feature of the chart is the large discontinuity in incident count at the beginning of 2014.

It seems very strange to me that the reported number of crimes would be fairly constant throughout 2012 and 2013, fall by around 40% as the year changed, then persist at a largely constant lower level around 3,700 incidents per month for the next year. I wonder if there were a major change in the way crimes were recorded or reported starting at the beginning of 2014.

Initial EDA: Rough plots with incidents segmented by crime type

After having uncovered some interesting insights about local crime trends at an aggregate level, I wanted to dive down into crime segments, using some of the multi-variate visualization techniques we covered in the later lessons of Data Analysis in R

To understand how I could best group crime report incidents by description, I surveyed all the potential values appearing in the columns CrimeCat and Desc

#unique(crimes$Desc) #I've suppressed the output for this column because there are about 1800 unique descriptions in the fields. This would not be a useful coulmn to use in grouping for plots.

unique(crimes$CrimeCat) #The CrimeCat column includes about 60 unique categories. This number is too large for tractable visualizations, but I decided I could group these into a smaller number of main categories (variable - mainCat) using grepl string matching
##  [1]                                   LARCENY_BURGLARY_AUTO            
##  [3] LARCENY_FORGERY-COUNTERFEIT       OTHER                            
##  [5] ASSAULT_MISDEMEANOR               COURT_MISDEMEANOR                
##  [7] QUALITY_DRUG                      TRAFFIC_DUI                      
##  [9] COURT_WARRANT                     TRAFFIC_TOWED-VEHICLE            
## [11] DOM-VIOL                          LARCENY_THEFT_VEHICLE_AUTO       
## [13] OTHER_RECOVERED                   VANDALISM                        
## [15] TRAFFIC_MISDEMEANOR               QUALITY_DRUG_POSSESSION_MARIJUANA
## [17] QUALITY_DRUG_POSSESSION           ASSAULT_SHOOTING                 
## [19] WEAPONS                           ASSAULT_BATTERY                  
## [21] ASSAULT                           ASSAULT_FIREARM                  
## [23] ASSAULT_OTHER-WEAPON              KIDNAPPING                       
## [25] DOM-VIOL_BATTERY-SPOUSE           QUALITY_DRUG_SALE-MFCTR          
## [27] ROBBERY_FIREARM                   ASSAULT_THREATS                  
## [29] ASSAULT_KNIFE                     LARCENY_THEFT_PETTY              
## [31] QUALITY_LIQUOR                    LARCENY_THEFT                    
## [33] ROBBERY_STRONG-ARM                LARCENY_THEFT_VEHICLE_OTHER      
## [35] ROBBERY                           HOMICIDE                         
## [37] LARCENY_RECEIVED                  LARCENY_POSSESSION               
## [39] LARCENY_BURGLARY_OTHER            LARCENY_BURGLARY_RESIDENTIAL     
## [41] SEX_RAPE                          OTHER_MISSING-PERSON             
## [43] SEX_OTHER                         LARCENY_BURGLARY_COMMERCIAL      
## [45] LARCENY_FRAUD                     ROBBERY_KNIFE                    
## [47] LARCENY_THEFT_GRAND               ARSON                            
## [49] ASSAULT_ATTEMPTED                 QUALITY_CURFEW-LOITERING         
## [51] ASSAULT_PEACE-OFFICER             ROBBERY_OTHER-WEAPON             
## [53] QUALITY_DISORDERLY-CONDUCT        ROBBERY_INHABITED-DWELLING       
## [55] SEX_PROSTITUTION                  LARCENY_THEFT_VEHICLE_CAR-JACKING
## [57] DOM-VIOL_CHILD                    LARCENY_ATTEMPTED                
## [59] OTHER_RUNAWAY                     MENTAL-ILLNESS                   
## 60 Levels:  ARSON ASSAULT ASSAULT_ATTEMPTED ... WEAPONS
#Thus I created a mainCat variable, grouping crimes by Homicide, Robbery/Larceny, Assault, Rape, weapons, domestic violence, traffic and court viaolations and 'Quality' (of life - this was a class of crimes noted in the dataset, which included somewhat minor violations including 'Curfew-Loitering', drug possession and )

crimes <- transform(crimes, mainCat = 
                        ifelse(grepl("HOMICIDE", CrimeCat, ignore.case = T),"homicide",
                        ifelse(grepl("ROBBERY|LARCENY", CrimeCat, ignore.case = T),"robbery",
                        ifelse(grepl("ASSAULT", CrimeCat, ignore.case = T),"assault",
                        ifelse(grepl("WEAPONS", CrimeCat, ignore.case = T),"weapons",
                        ifelse(grepl("DOM-VIOL", CrimeCat, ignore.case = T),"domestic_violence",
                        ifelse(grepl("RAPE", CrimeCat, ignore.case = T),"rape",
                        ifelse(grepl("TRAFFIC", CrimeCat, ignore.case = T),"traffic",
                        ifelse(grepl("COURT", CrimeCat, ignore.case = T),"court",       
                        ifelse(grepl("QUALITY|VANDALISM", CrimeCat, ignore.case = T),"quality","other"))))))))))

#I then surveyed examples from the output to ensure the new column assignment worked correctly
#head(crimes,50) #suppresed for the report
#tail(crimes,50)

#As in previous examples, I created a new dataframe with dplyr functions, this time grouping by both date and the new mainCat variable
#Also - to note - I was seeing several report data points for dates in the future - I assume these were the result of initial entry errors. For all plots going forward, I have chosen to subset by incidents occurring prior to a 'cut-off' date of June 2015.

cutOffDate = '2015-06-01'
crime_types_by_day <- subset(crimes, date_format < cutOffDate) %>%
  group_by(date_format, mainCat) %>%
  summarize(incidents = sum(any_crime),
            n = n()) %>%
            ungroup() %>%
  arrange(date_format,incidents)

#Following is the scatterplot by day of incidents, colored by mainCat variable
ggplot(aes(x= date_format, y=incidents), data=crime_types_by_day) +
  geom_point(aes(color = mainCat), alpha = 0.3) +
  geom_smooth(aes(color = mainCat))
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

#The results are fairly messy with daily measures, so I decided to create a similar scatterplot with mainCat groupings but monthly crime incidents on the x-axis

#For this I built a new crime_types_by_month dataframe...
crime_types_by_month <- subset(crimes, date_format < cutOffDate) %>%
  group_by(month, mainCat) %>%
  summarize(incidents = sum(any_crime),
            n = n()) %>%
            ungroup() %>%
  arrange(month,incidents)

#...then plotted it
ggplot(aes(x= month, y=incidents), data=crime_types_by_month) +
  geom_jitter(aes(color = mainCat), alpha = 0.6) +
  geom_smooth(aes(color = mainCat))
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

#Many interesting trends are visible when grouping incidents by type in a single plot, but the output is still fairly messy and dynamics for some categories are hard to discern, as the scales of total incidents in each crime category are substantially different

#Thus, I decided to re-plot crime_types_by_month in a facet wrap with 'free_y' scale to better view dynamics by crime type
ggplot(aes(x= month, y=incidents), data=crime_types_by_month) +
  geom_point(aes(color = mainCat)) +
  facet_wrap(~ mainCat, scales='free_y') +
  geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

I found this chart to be very striking - as incidents for all the main crime categories appear to have dropped significantly, while each shows a different pattern of decline. Some categries like Assault and Robbery plummeted around 2010 then remained steady, with others - like Homicide, Traffic, Domestic Violence and Other showing big drops later, around 2013 and 2014.

The strange discontinuity at the 2014 year mark I noted earlier is also present in these (the later declining) categories, with incident counts holding steady in 2014 and 2015 after falling massively from much higher levels immediately before 2013 year-end.

Initial EDA: Rough plots with incidents segmented by location

Besides tracking crime dynamics by type, I also wanted to explore how crimes were distributed geographically. To do so, I used the Beat variable, which indicates where the crime occurred/was recorded from a few dozen zones within Oakland.

Similar to the previous exploration, I built a new dataframe, grouping on incidents per month and Beat

crime_by_month_and_beat <- subset(crimes, date_format < cutOffDate) %>%
  group_by(month, Beat) %>%
  summarize(incidents = sum(any_crime),
            n = n()) %>%
            ungroup() %>%
  arrange(month)

#I plotted month and beat dynamics on one chart, associating each beat with a color.
#Note: suppressed as result wasn't useful
#ggplot(aes(x= month, y=incidents), data=crime_by_month_and_beat) +
#  geom_jitter(aes(color = Beat), alpha = 0.6) +
#  geom_smooth(aes(color = Beat))

#This scatterplot looked messy (with no obvious insights), so I thought perhaps a stacked bar chart of incidents by beat might show more insight.

#I also tried a stacked bar chart of monthly incidents, colored by police beat.
#Note: suppressed as result wasn't useful
#ggplot(crime_by_month_and_beat, aes(x = month, fill=Beat)) +
#  geom_bar(bin = 30)

#The bar plot didn't provide valuable insight and didn't show the trends visible in other graphs, so I decided this approach should be discarded

Going forward I decided that there are too many police beats to effectively display in a chart, and wanted to take the top 20 beats, then group incidents in other beats all under ‘Other’. As you’ll see in calculations below - these top 20 beats cover about 52% of all crime reports with descriptions (our any_crime variable).

As another note - I was also relieved to see that the Beat covering my home was not in the top 20 based on total incident count.

#First I created dataframe with total crime incidents regardless of date
total_crime_by_beat <- subset(crimes, date_format < cutOffDate) %>%
  group_by(Beat) %>%
  summarize(incidents = sum(any_crime),
            n = n()) %>%
  arrange(desc(incidents))
head(total_crime_by_beat) #To check the output
## Source: local data frame [6 x 3]
## 
##     Beat incidents     n
##   (fctr)     (dbl) (int)
## 1    08X     24065 25155
## 2    04X     23466 24726
## 3    34X     22432 23628
## 4    06X     19954 20929
## 5    30X     19241 20378
## 6    19X     18968 19958
sum(total_crime_by_beat$incidents[1:20]) / sum(total_crime_by_beat$incidents)
## [1] 0.5221023
#Note: My beat
#Suppressed:
#myBeat <- subset(total_crime_by_beat,Beat == '22X')
#which(grepl('22X',total_crime_by_beat$Beat))

#Here I added another column to the crimes - mainBeat, perserving the value for the top 20 beats by incidents, and labeling 'other' for all other beats
topBeats = total_crime_by_beat$Beat[1:20]
crimes <- transform(crimes, mainBeat = ifelse((Beat %in% topBeats), as.character(Beat), 'other'))

#Then I revised the crime_by_month_and_beat again with the mainBeat variable to produce clearer output...
crime_by_month_and_beat <- subset(crimes, date_format < cutOffDate) %>%
  group_by(month, mainBeat) %>%
  summarize(incidents = sum(any_crime),
            n = n()) %>%
            ungroup() %>%
  arrange(incidents)

#unique(crimes$mainBeat)
#head(crimes)

#...re-plotting with facet wrap to more clearly see the crime incident dynamic by beat
ggplot(aes(x= month, y=incidents), data=crime_by_month_and_beat) +
  geom_point(aes(color = mainBeat)) +
  facet_wrap(~ mainBeat, scales='free_y') +
  geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Again, in the per-Beat facet breakdown we see uniformly down-trending crime incidence over time, with some beats seeing more pronounced local spikes up around mid- to end of 2013. For some beats, including 04X, 08X, 20X and other, we also see the strange discontinuity at the end of 2013 that appeared in other views of monthly crime over time.

For these police beats, crime stays steady or spikes toward the end of 2013, then precipitously drops right at the new year, and remains or declines from the lower level to the present day.


Final Plots and Summary

For my final overview plots I chose to sharpen and adjust a few views we looked at in the Initial EDA section, and to refine the associated fitted curves with linear models for each plot, instead of using the standard non-parametric geom_smooth function.

Final Plot 1: Aggregate view of weekly crime incidence in dataset

For my final aggregate chart, I also wanted to set up a simple model for predicting crime monthly crime incidence in future periods based on time elapsed.

#For the aggregate crime trends chart, I decided to use weekly crime incident data on my x-axis, as this could provide a balance between the bias and variance poles of daily and monthly periods from earlier plots. I also plotted using the log of y, and plotted with a linear model, looking run a similar regression to arrive at usefully interpretable coeffficients.

#First: weekly analysis requires weekly units, cutting date_format accordingly
crimes$week <- as.Date(cut(crimes$date_format,
  breaks = "week"))

#Then: building dataframe with weekly crimve values
crime_by_week <- subset(crimes, date_format < cutOffDate) %>%
  group_by(week) %>%
  summarize(incidents = sum(any_crime),
            n = n()) %>%
  arrange(week)
#head(crime_by_week) #To verify dataframe output

#Aggregate crime incidence scatterplot by week
ggplot(aes(x=week, y=log(incidents)), data=crime_by_week) +
  geom_point(color = 'darkblue', alpha = 0.8) +
  ylab("Log of Crimes Reported Per Week") +
  xlab("Date") +
  scale_y_continuous(breaks = seq(6,8,0.25)) +
  stat_smooth(method = 'lm', formula = I(y)~I(as.numeric(x)),  color = 'red') +
  ggtitle("Weekly Crime Incidence in Oakland:\n January 2007 to June 2015") +
  theme(plot.title = element_text(lineheight=.8, face="bold"))

#Out of curiosity I also looked at the summary stats for the regression line, and note that the fit is very strong, with a F-test p-value of close to zero and a R-squared of around 0.82
c1 <- lm(I(log(incidents)) ~ I(as.numeric(week)), data = crime_by_week)
summary(c1)
## 
## Call:
## lm(formula = I(log(incidents)) ~ I(as.numeric(week)), data = crime_by_week)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38938 -0.08481 -0.01057  0.08579  0.32571 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.188e+01  1.037e-01  114.57   <2e-16 ***
## I(as.numeric(week)) -3.064e-04  6.878e-06  -44.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1278 on 437 degrees of freedom
## Multiple R-squared:  0.8195, Adjusted R-squared:  0.8191 
## F-statistic:  1984 on 1 and 437 DF,  p-value: < 2.2e-16
#Based on the coefficient here, and the way as.numeric converts week variables, my understanding is that the model projects crime falling by about 11% per year going forward.

Final Plot 2: View of monthly crime, faceted by crime type

In my second chart, I wanted to refine the faceted-by-crime-type charts I produced in the exploratory analysis, but use corresponding regression lines based on linear models for each facet as in Final Plot 1. For the facets, I decided to draw on monthly crime statistics, as weekly figures in certain categories were very sparse with too much variance, and I left the y-scale free.

#Scatterplot of crime data by month, faceted by crime type
ggplot(aes(x=month, y=log(incidents)), data=crime_types_by_month) +
  geom_point(aes(color = mainCat), alpha = 0.9) +
  ylab("Log of Crimes Reported Per Month") +
  xlab("Date") +
  facet_wrap(~ mainCat, scales='free_y', ncol=5) +
  stat_smooth(method = 'lm', formula = I(y)~I(as.numeric(x)),  color = 'red') +
  ggtitle("Monthly Crime Incidence in Oakland by Crime Category:\n January 2007 to June 2015") +
  theme(plot.title = element_text(lineheight=.8, face="bold"))

#Note that the discontinuities in crime reduction around the beginning of 2014 are even more striking when plotting the log of incidents on the y axis

#To test how the inclusion of crime type variables could improve our model accuracy I decided to also include mainCat as a independent variable in a linear regression model - the addition seems to improve fit slightly versus a model of incidents vs time. R-squared for the new model is 0.85
cm1 <- lm(I(log(incidents)) ~ I(as.numeric(month)), data = crime_types_by_month)
cm2 <- update(cm1, ~ . + mainCat)
summary(cm2)
## 
## Call:
## lm(formula = I(log(incidents)) ~ I(as.numeric(month)) + mainCat, 
##     data = crime_types_by_month)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8406 -0.2704  0.0823  0.3921  1.4612 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.530e+01  3.563e-01  42.949   <2e-16 ***
## I(as.numeric(month))     -5.852e-04  2.331e-05 -25.103   <2e-16 ***
## mainCatcourt             -2.004e+00  9.156e-02 -21.890   <2e-16 ***
## mainCatdomestic_violence -1.240e-01  9.038e-02  -1.371   0.1705    
## mainCathomicide          -2.908e+00  9.038e-02 -32.176   <2e-16 ***
## mainCatother              1.522e-01  9.038e-02   1.684   0.0925 .  
## mainCatquality            1.551e-01  9.038e-02   1.716   0.0865 .  
## mainCatrape              -3.451e+00  9.407e-02 -36.688   <2e-16 ***
## mainCatrobbery            1.377e+00  9.038e-02  15.234   <2e-16 ***
## mainCattraffic           -1.006e+00  9.038e-02 -11.128   <2e-16 ***
## mainCatweapons           -2.131e+00  9.038e-02 -23.581   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6423 on 980 degrees of freedom
## Multiple R-squared:  0.8523, Adjusted R-squared:  0.8508 
## F-statistic: 565.5 on 10 and 980 DF,  p-value: < 2.2e-16

Final Plot 3: View of monthly crime, faceted by police beat

#Scatterplot of crime data by month, faceted by crime type
ggplot(aes(x=month, y=log(incidents)), data=crime_by_month_and_beat) +
  geom_point(aes(color = mainBeat), alpha = 0.9) +
  ylab("Log of Crimes Reported Per Month") +
  xlab("Date") +
  facet_wrap(~mainBeat, scales='free_y', ncol=3) +
  stat_smooth(method = 'lm', formula = I(y)~I(as.numeric(x)),  color = 'red') +
  ggtitle("Monthly Crime Incidence in Oakland by Police Beat:\n January 2007 to June 2015") +
  theme(plot.title = element_text(lineheight=.8, face="bold"))

#To test how police beat data could improve our model accuracy I decided to also include mainBeat as a independent variable in a linear regression model with monthly data. The addition seems to improve fit slightly - the modified R-squared is 0.93
cmb1 <- lm(I(log(incidents)) ~ I(as.numeric(month)), data = crime_by_month_and_beat)
cmb2 <- update(cmb1, ~ . + mainBeat)
summary(cmb2)
## 
## Call:
## lm(formula = I(log(incidents)) ~ I(as.numeric(month)) + mainBeat, 
##     data = crime_by_month_and_beat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68932 -0.12451 -0.00748  0.12297  0.66423 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.029e+01  7.442e-02 138.322  < 2e-16 ***
## I(as.numeric(month)) -3.236e-04  4.778e-06 -67.723  < 2e-16 ***
## mainBeat06X          -2.474e-01  2.748e-02  -9.004  < 2e-16 ***
## mainBeat07X          -3.771e-01  2.748e-02 -13.722  < 2e-16 ***
## mainBeat08X           1.770e-02  2.748e-02   0.644     0.52    
## mainBeat19X          -2.319e-01  2.748e-02  -8.440  < 2e-16 ***
## mainBeat20X          -4.825e-01  2.748e-02 -17.559  < 2e-16 ***
## mainBeat21Y          -6.684e-01  2.748e-02 -24.324  < 2e-16 ***
## mainBeat23X          -3.089e-01  2.748e-02 -11.240  < 2e-16 ***
## mainBeat25X          -6.808e-01  2.748e-02 -24.774  < 2e-16 ***
## mainBeat26Y          -2.888e-01  2.748e-02 -10.509  < 2e-16 ***
## mainBeat27X          -5.470e-01  2.748e-02 -19.906  < 2e-16 ***
## mainBeat27Y          -3.194e-01  2.748e-02 -11.625  < 2e-16 ***
## mainBeat29X          -4.115e-01  2.748e-02 -14.974  < 2e-16 ***
## mainBeat30X          -2.289e-01  2.748e-02  -8.329  < 2e-16 ***
## mainBeat30Y          -3.359e-01  2.748e-02 -12.225  < 2e-16 ***
## mainBeat31Y          -6.490e-01  2.748e-02 -23.617  < 2e-16 ***
## mainBeat32X          -5.873e-01  2.748e-02 -21.374  < 2e-16 ***
## mainBeat33X          -4.854e-01  2.748e-02 -17.665  < 2e-16 ***
## mainBeat34X          -1.217e-01  2.748e-02  -4.430  9.9e-06 ***
## mainBeat35X          -5.031e-01  2.748e-02 -18.309  < 2e-16 ***
## mainBeatother         2.574e+00  2.748e-02  93.663  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1953 on 2099 degrees of freedom
## Multiple R-squared:  0.9317, Adjusted R-squared:  0.931 
## F-statistic:  1363 on 21 and 2099 DF,  p-value: < 2.2e-16

Reflection

My biggest struggles with the dataset related to the process of grouping data into dataframes with incidents over useful time periods, and segmenting by tractable crime and police beat groups. Going through the steps required for my initial and final plots, I learned much about string matching, dataframe reshaping and subsetting in R.

Once the data was properly grouped and I could run plots, I found the output to be fascinating. Finding the dataset was a huge success for me, as it is a rich source of insight on a topic in which I have a strong interest. With the instruction in the Data Analysis in R course, I enjoyed visualizing elements of the dataset, and revealing the evidence highlighted in my Final Plots above.

The exploratory analysis here prompts me to explore a few more questions related to the crime data. A few that are top-of-mind I outline below

  1. What caused the massive, sudden drop at the beginning of 2014 in reported crimes for many crime categories? The discontinuity at the change in the year is so striking, it seems that it must be due to a change in policing or reporting policy, rather than a strange and dramatic drop-off in crimes committed.
  2. What insight could we draw from plots using the Time variable (time of day)? I have strong hypotheses about how crime would vary by time-of-day, but would love to tests those with the data.
  3. How are crime types and police beats correlated? Are there specific regions associated with particular crimes?
  4. Also, with additional data, I’d love to explore how changes in crime rates have been associated with changes in economic outcomes, property values and educational indicators in Oakland.

In my explorations, I also came across many references to packages that might be useful for time-series analysis in the future, such as ‘zoo’. I’d like to try these out on other datasets.


Refences and Sources (by topic)

Melt: http://www.r-bloggers.com/melt/

Reshape background: http://seananderson.ca/2013/10/19/reshape.html

Melting for time series: http://stackoverflow.com/questions/1181060/reshaping-time-series-data-from-wide-to-tall-format-for-plotting

Reshaping data: http://www.r-bloggers.com/reshape-and-aggregate-data-with-the-r-package-reshape2/

Creating columns with if-else statements: http://stackoverflow.com/questions/13672781/populate-a-column-using-if-statements-in-r

String matching with grepl: http://www.endmemo.com/program/R/grepl.php

Choosing between regression models: http://stats.stackexchange.com/questions/43930/choosing-between-lm-and-glm-for-a-log-transformed-response-variable

Creating graph titles with ggplot: http://www.cookbook-r.com/Graphs/Titles_(ggplot2)/

Overlaying fitted regressions in ggplot: http://stackoverflow.com/questions/1476185/how-to-overlay-a-line-for-an-lm-object-on-a-ggplot2-scatterplot http://stackoverflow.com/questions/10528631/add-exp-power-trend-line-to-a-ggplot -

Dataset Description

Crime reports from Oakland: January 2007 to July 2015

Description

A dataset containing discriptions, event timing, and geographic information from over 690,000 crime reports in Oakland during the period 2007 to mid-2015

Usage

Read in the dataset from csv

Format

A data frame with 696372 rows and 21 variables

Details

See more details on the dataset in a wiki page built by Rik Belew, at this link.